1,272 research outputs found
Performance Models for Data Transfers: A Case Study with Molecular Chemistry Kernels
With increasing complexity of hardwares, systems with different memory nodes
are ubiquitous in High Performance Computing (HPC). It is paramount to develop
strategies to overlap the data transfers between memory nodes with computations
in order to exploit the full potential of these systems. In this article, we
consider the problem of deciding the order of data transfers between two memory
nodes for a set of independent tasks with the objective to minimize the
makespan. We prove that with limited memory capacity, obtaining the optimal
order of data transfers is a NP-complete problem. We propose several heuristics
for this problem and provide details about their favorable situations. We
present an analysis of our heuristics on traces, obtained by running 2
molecular chemistry kernels, namely, Hartree-Fock (HF) and Coupled Cluster
Single Double (CCSD) on 10 nodes of an HPC system. Our results show that some
of our heuristics achieve significant overlap for moderate memory capacities
and are very close to the lower bound of makespan
Improving Performance of Iterative Methods by Lossy Checkponting
Iterative methods are commonly used approaches to solve large, sparse linear
systems, which are fundamental operations for many modern scientific
simulations. When the large-scale iterative methods are running with a large
number of ranks in parallel, they have to checkpoint the dynamic variables
periodically in case of unavoidable fail-stop errors, requiring fast I/O
systems and large storage space. To this end, significantly reducing the
checkpointing overhead is critical to improving the overall performance of
iterative methods. Our contribution is fourfold. (1) We propose a novel lossy
checkpointing scheme that can significantly improve the checkpointing
performance of iterative methods by leveraging lossy compressors. (2) We
formulate a lossy checkpointing performance model and derive theoretically an
upper bound for the extra number of iterations caused by the distortion of data
in lossy checkpoints, in order to guarantee the performance improvement under
the lossy checkpointing scheme. (3) We analyze the impact of lossy
checkpointing (i.e., extra number of iterations caused by lossy checkpointing
files) for multiple types of iterative methods. (4)We evaluate the lossy
checkpointing scheme with optimal checkpointing intervals on a high-performance
computing environment with 2,048 cores, using a well-known scientific
computation package PETSc and a state-of-the-art checkpoint/restart toolkit.
Experiments show that our optimized lossy checkpointing scheme can
significantly reduce the fault tolerance overhead for iterative methods by
23%~70% compared with traditional checkpointing and 20%~58% compared with
lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance
SWIFT: Using task-based parallelism, fully asynchronous communication, and graph partition-based domain decomposition for strong scaling on more than 100,000 cores
We present a new open-source cosmological code, called SWIFT, designed to solve the equations of hydrodynamics using a particle-based approach (Smooth Particle Hydrodynamics) on hybrid shared / distributed-memory architectures. SWIFT was designed from the bottom up to provide excellent strong scaling on both commodity clusters (Tier-2 systems) and Top100-supercomputers (Tier-0 systems), without relying on architecture-specific features or specialized accelerator hardware. This performance is due to three main computational approaches: • Task-based parallelism for shared-memory parallelism, which provides fine-grained load balancing and thus strong scaling on large numbers of cores. • Graph-based domain decomposition, which uses the task graph to decompose the simulation domain such that the work, as opposed to just the data, as is the case with most partitioning schemes, is equally distributed across all nodes. • Fully dynamic and asynchronous communication, in which communication is modelled as just another task in the task-based scheme, sending data whenever it is ready and deferring on tasks that rely on data from other nodes until it arrives. In order to use these approaches, the code had to be re-written from scratch, and the algorithms therein adapted to the task-based paradigm. As a result, we can show upwards of 60% parallel efficiency for moderate-sized problems when increasing the number of cores 512-fold, on both x86-based and Power8-based architectures
Large non-Gaussian Halo Bias from Single Field Inflation
We calculate Large Scale Structure observables for non-Gaussianity arising
from non-Bunch-Davies initial states in single field inflation. These scenarios
can have substantial primordial non-Gaussianity from squeezed (but observable)
momentum configurations. They generate a term in the halo bias that may be more
strongly scale-dependent than the contribution from the local ansatz. We also
discuss theoretical considerations required to generate an observable
signature.Comment: 30 pages, 14 figures, typos corrected and minor changes to match
published version JCAP09(2012)00
Loop Quantum Gravity and the The Planck Regime of Cosmology
The very early universe provides the best arena we currently have to test
quantum gravity theories. The success of the inflationary paradigm in
accounting for the observed inhomogeneities in the cosmic microwave background
already illustrates this point to a certain extent because the paradigm is
based on quantum field theory on the curved cosmological space-times. However,
this analysis excludes the Planck era because the background space-time
satisfies Einstein's equations all the way back to the big bang singularity.
Using techniques from loop quantum gravity, the paradigm has now been extended
to a self-consistent theory from the Planck regime to the onset of inflation,
covering some 11 orders of magnitude in curvature. In addition, for a narrow
window of initial conditions, there are departures from the standard paradigm,
with novel effects, such as a modification of the consistency relation
involving the scalar and tensor power spectra and a new source for
non-Gaussianities. Thus, the genesis of the large scale structure of the
universe can be traced back to quantum gravity fluctuations \emph{in the Planck
regime}. This report provides a bird's eye view of these developments for the
general relativity community.Comment: 23 pages, 4 figures. Plenary talk at the Conference: Relativity and
Gravitation: 100 Years after Einstein in Prague. To appear in the Proceedings
to be published by Edition Open Access. Summarizes results that appeared in
journal articles [2-13
Global Time Distribution via Satellite-Based Sources of Entangled Photons
We propose a satellite-based scheme to perform clock synchronization between
ground stations spread across the globe using quantum resources. We refer to
this as a quantum clock synchronization (QCS) network. Through detailed
numerical simulations, we assess the feasibility and capabilities of a
near-term implementation of this scheme. We consider a small constellation of
nanosatellites equipped only with modest resources. These include quantum
devices such as spontaneous parametric down conversion (SPDC) sources,
avalanche photo-detectors (APDs), and moderately stable on-board clocks such as
chip scale atomic clocks (CSACs). In our simulations, the various performance
parameters describing the hardware have been chosen such that they are either
already commercially available, or require only moderate advances. We conclude
that with such a scheme establishing a global network of ground based clocks
synchronized to sub-nanosecond level (up to a few picoseconds) of precision,
would be feasible. Such QCS satellite constellations would form the
infrastructure for a future quantum network, able to serve as a globally
accessible entanglement resource. At the same time, our clock synchronization
protocol, provides the sub-nanosecond level synchronization required for many
quantum networking protocols, and thus, can be seen as adding an extra layer of
utility to quantum technologies in the space domain designed for other
purposes.Comment: 20 pages, 12 figures and 6 tables. Comments are welcom
Nanoscale piezoelectric response across a single antiparallel ferroelectric domain wall
Surprising asymmetry in the local electromechanical response across a single
antiparallel ferroelectric domain wall is reported. Piezoelectric force
microscopy is used to investigate both the in-plane and out-of- plane
electromechanical signals around domain walls in congruent and
near-stoichiometric lithium niobate. The observed asymmetry is shown to have a
strong correlation to crystal stoichiometry, suggesting defect-domain wall
interactions. A defect-dipole model is proposed. Finite element method is used
to simulate the electromechanical processes at the wall and reconstruct the
images. For the near-stoichiometric composition, good agreement is found in
both form and magnitude. Some discrepancy remains between the experimental and
modeling widths of the imaged effects across a wall. This is analyzed from the
perspective of possible electrostatic contributions to the imaging process, as
well as local changes in the material properties in the vicinity of the wall
Remarks on the renormalization of primordial cosmological perturbations
We briefly review the need to perform renormalization of inflationary
perturbations to properly work out the physical power spectra. We also
summarize the basis of (momentum-space) renormalization in curved spacetime and
address several misconceptions found in recent literature on this subject.Comment: 5 page
- …